This is the task to build retail sales prediction or forecasting using the data provided by Corporación Favorita Grocery chain in Ecuador. The data was provided for Kaggle competition. I was given this task as part my interview process by Coles for the position of Data Scientist.
To complete the task, Below is the road map.
Brief Info about Corporación Favorita Grocery
Exploratory Data Analysis
Feature Engineering and Selection
Model Training and Selection
Preparing/Deploying Model for Sales Forecasting
Future works
Corporación Favorita Grocery is grocery stores chain and is being traded under SLF ticker name. It is a puclic listed corporation and based in Pichincha Quito Ecuador.
My Exposure to Ecudor
The data is provided by this giant to kaggler community.
Let me have a crack on this data.
As a first step, I start with looking at data tables provided. Below are the questions in my mind to answer by looking at various tables.
There are various kaggle kernels available where people tried to perform EDA. Shopping for Insights - Favorita EDA looks like comprehensive among others. This kernel is written using tidy R and ggplot2.
I will be using plotly express to do my analysis. It is relatively new python interactive plotting tool and is part of plotly family.
import pandas as pd
import plotly.express as px
import os
import glob
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;
# input data directory
# I have downlaoded the data from Kaggle Competition and unzip using 7z tool in ubuntu command terminal
# TODO: Use the Kaggle API to access the data
data_dir = '/home/ibbu/PycharmProjects/retail_sales_prediction/data/'
files = [f for f in os.listdir(data_dir) if f.endswith('.csv')]
files
# Quickly see the information about the files and sizes to be read
!ls -la ../data/
No surprises, training data has the biggest size.
class readDataStore():
"""
This module is to read the data from data store.
In our case, our date store is local and is in .csv format
"""
def __init__(self, data_dir):
self.data_dir = data_dir
def read_stores(self, file_name):
df = pd.read_csv(self.data_dir + file_name)
return df
def read_items(self, file_name):
df = pd.read_csv(self.data_dir + file_name)
return df
def read_transactions(self, file_name):
df = pd.read_csv(self.data_dir + file_name)
return df
def read_oil(self, file_name):
df = pd.read_csv(self.data_dir + file_name)
return df
def read_holidays(self, file_name):
df = pd.read_csv(self.data_dir + file_name)
return df
def read_train(self, file_name):
df = pd.read_csv(self.data_dir + file_name)
return df
def read_test(self, file_name):
df = pd.read_csv(self.data_dir + file_name)
return df
readDataStore = readDataStore(data_dir)
df_store = readDataStore.read_stores('stores.csv')
df_store.shape
df_store.describe()
df_store_count = df_store.groupby(['state', 'city', 'type', 'cluster'])['store_nbr'].count()
df_store_count = df_store_count.reset_index().rename(columns = {'store_nbr':'count'})
px.bar(df_store_count, x = 'state', y= 'count', color = 'city', facet_col='type')
px.bar(df_store_count, x = 'cluster', y= 'count', color = 'type')
There is interesting observation, cluster 10 has three types of stores while all other clusters have only one kind of store. Lets dig deep what is special about this cluster.
df_store_count[df_store_count['cluster']==10]
df_items = readDataStore.read_stores('items.csv')
df_items.shape
df_items_count = df_items.groupby(['family', 'class', 'perishable'])['item_nbr'].count()
df_items_count = df_items_count.reset_index().rename(columns = {'item_nbr':'count'})
df_items_count.head()
px.bar(df_items_count, x = 'family', y= 'count', color = 'class',facet_col='perishable')
px.bar(df_items_count, x = 'family', y= 'count', color = 'perishable')
df_transactions = readDataStore.read_transactions('transactions.csv')
df_transactions.count()
No Missing Values, date and store number are the unique columns in this table.
import datetime
import calendar
df_transactions.head()
df_transactions['date'] = pd.to_datetime(df_transactions['date'])
df_transactions['day_of_week'] = df_transactions['date'].dt.weekday_name
set(df_transactions['day_of_week'])
# fig = px.bar(df_transactions, x="date", y="transactions",color ='store_nbr' , barmode="group", facet_row="day_of_week",
# height= 2000,
# category_orders={"day": ['Monday', 'Tuesday', 'Wednesday','Thursday','Friday', 'Saturday', 'Sunday']})
# fig.show()
px.violin(df_transactions, y="transactions", x='day_of_week', box=True, points='all')